98 research outputs found

    Solving the Ghost-Gluon System of Yang-Mills Theory on GPUs

    Full text link
    We solve the ghost-gluon system of Yang-Mills theory using Graphics Processing Units (GPUs). Working in Landau gauge, we use the Dyson-Schwinger formalism for the mathematical description as this approach is well-suited to directly benefit from the computing power of the GPUs. With the help of a Chebyshev expansion for the dressing functions and a subsequent appliance of a Newton-Raphson method, the non-linear system of coupled integral equations is linearized. The resulting Newton matrix is generated in parallel using OpenMPI and CUDA(TM). Our results show, that it is possible to cut down the run time by two orders of magnitude as compared to a sequential version of the code. This makes the proposed techniques well-suited for Dyson-Schwinger calculations on more complicated systems where the Yang-Mills sector of QCD serves as a starting point. In addition, the computation of Schwinger functions using GPU devices is studied.Comment: 19 pages, 7 figures, additional figure added, dependence on block-size is investigated in more detail, version accepted by CP

    Solving Lattice QCD systems of equations using mixed precision solvers on GPUs

    Full text link
    Modern graphics hardware is designed for highly parallel numerical tasks and promises significant cost and performance benefits for many scientific applications. One such application is lattice quantum chromodyamics (lattice QCD), where the main computational challenge is to efficiently solve the discretized Dirac equation in the presence of an SU(3) gauge field. Using NVIDIA's CUDA platform we have implemented a Wilson-Dirac sparse matrix-vector product that performs at up to 40 Gflops, 135 Gflops and 212 Gflops for double, single and half precision respectively on NVIDIA's GeForce GTX 280 GPU. We have developed a new mixed precision approach for Krylov solvers using reliable updates which allows for full double precision accuracy while using only single or half precision arithmetic for the bulk of the computation. The resulting BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations until convergence, perform better than the usual defect-correction approach for mixed precision.Comment: 30 pages, 7 figure

    Accelerated Event-by-Event Neutrino Oscillation Reweighting with Matter Effects on a GPU

    Get PDF
    Oscillation probability calculations are becoming increasingly CPU intensive in modern neutrino oscillation analyses. The independency of reweighting individual events in a Monte Carlo sample lends itself to parallel implementation on a Graphics Processing Unit. The library "Prob3++" was ported to the GPU using the CUDA C API, allowing for large scale parallelized calculations of neutrino oscillation probabilities through matter of constant density, decreasing the execution time by a factor of 75, when compared to performance on a single CPU.Comment: Final Update: Post submission update Updated version: quantified the difference in event rates for binned and event-by-event reweighting with a typical binning scheme. Improved formatting of reference

    APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters

    Full text link
    We describe herein the APElink+ board, a PCIe interconnect adapter featuring the latest advances in wire speed and interface technology plus hardware support for a RDMA programming model and experimental acceleration of GPU networking; this design allows us to build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of our cost-effective, tens-of-thousands-scalable cluster network architecture. Some test results and characterization of data transmission of a complete testbench, based on a commercial development card mounting an Altera FPGA, are provided.Comment: 6 pages, 7 figures, proceeding of CHEP 2010, Taiwan, October 18-2

    Fine-grained bit-flip protection for relaxation methods

    Full text link
    [EN] Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance.This material is based upon work supported in part by the U.S. Department of Energy (Award Number DE-SC-0010042) and NVIDIA. E. S. Quintana-Orti was supported by project CICYT TIN2014-53495-R of MINECO and FEDER.Anzt, H.; Dongarra, J.; Quintana Ortí, ES. (2019). Fine-grained bit-flip protection for relaxation methods. Journal of Computational Science. 36:1-11. https://doi.org/10.1016/j.jocs.2016.11.013S11136Chow, E., & Patel, A. (2015). Fine-Grained Parallel Incomplete LU Factorization. SIAM Journal on Scientific Computing, 37(2), C169-C193. doi:10.1137/140968896Karpuzcu, U. R., Kim, N. S., & Torrellas, J. (2013). Coping with Parametric Variation at Near-Threshold Voltages. IEEE Micro, 33(4), 6-14. doi:10.1109/mm.2013.71Bronevetsky, G., & de Supinski, B. (2008). Soft error vulnerability of iterative linear algebra methods. Proceedings of the 22nd annual international conference on Supercomputing - ICS ’08. doi:10.1145/1375527.1375552Sao, P., & Vuduc, R. (2013). Self-stabilizing iterative solvers. Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA ’13. doi:10.1145/2530268.2530272Calhoun, J., Snir, M., Olson, L., & Garzaran, M. (2015). Understanding the Propagation of Error Due to a Silent Data Corruption in a Sparse Matrix Vector Multiply. 2015 IEEE International Conference on Cluster Computing. doi:10.1109/cluster.2015.101Chazan, D., & Miranker, W. (1969). Chaotic relaxation. Linear Algebra and its Applications, 2(2), 199-222. doi:10.1016/0024-3795(69)90028-7Frommer, A., & Szyld, D. B. (2000). On asynchronous iterations. Journal of Computational and Applied Mathematics, 123(1-2), 201-216. doi:10.1016/s0377-0427(00)00409-xDuff, I. S., & Meurant, G. A. (1989). The effect of ordering on preconditioned conjugate gradients. BIT, 29(4), 635-657. doi:10.1007/bf01932738Aliaga, J. I., Barreda, M., Dolz, M. F., Martín, A. F., Mayo, R., & Quintana-Ortí, E. S. (2014). Assessing the impact of the CPU power-saving modes on the task-parallel solution of sparse linear systems. Cluster Computing, 17(4), 1335-1348. doi:10.1007/s10586-014-0402-

    Simulation of reaction-diffusion processes in three dimensions using CUDA

    Get PDF
    Numerical solution of reaction-diffusion equations in three dimensions is one of the most challenging applied mathematical problems. Since these simulations are very time consuming, any ideas and strategies aiming at the reduction of CPU time are important topics of research. A general and robust idea is the parallelization of source codes/programs. Recently, the technological development of graphics hardware created a possibility to use desktop video cards to solve numerically intensive problems. We present a powerful parallel computing framework to solve reaction-diffusion equations numerically using the Graphics Processing Units (GPUs) with CUDA. Four different reaction-diffusion problems, (i) diffusion of chemically inert compound, (ii) Turing pattern formation, (iii) phase separation in the wake of a moving diffusion front and (iv) air pollution dispersion were solved, and additionally both the Shared method and the Moving Tiles method were tested. Our results show that parallel implementation achieves typical acceleration values in the order of 5-40 times compared to CPU using a single-threaded implementation on a 2.8 GHz desktop computer.Comment: 8 figures, 5 table

    Skyline: Interactive In-Editor Computational Performance Profiling for Deep Neural Network Training

    Full text link
    Training a state-of-the-art deep neural network (DNN) is a computationally-expensive and time-consuming process, which incentivizes deep learning developers to debug their DNNs for computational performance. However, effectively performing this debugging requires intimate knowledge about the underlying software and hardware systems---something that the typical deep learning developer may not have. To help bridge this gap, we present Skyline: a new interactive tool for DNN training that supports in-editor computational performance profiling, visualization, and debugging. Skyline's key contribution is that it leverages special computational properties of DNN training to provide (i) interactive performance predictions and visualizations, and (ii) directly manipulatable visualizations that, when dragged, mutate the batch size in the code. As an in-editor tool, Skyline allows users to leverage these diagnostic features to debug the performance of their DNNs during development. An exploratory qualitative user study of Skyline produced promising results; all the participants found Skyline to be useful and easy to use.Comment: 14 pages, 5 figures. Appears in the proceedings of UIST'2
    • …
    corecore